IEEE Transactions on Computational Biology and Bioinformatics
● Institute of Electrical and Electronics Engineers (IEEE)
Preprints posted in the last 30 days, ranked by how well they match IEEE Transactions on Computational Biology and Bioinformatics's content profile, based on 17 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.
Huseynov, R.; Otlu, B.
Show abstract
Somatic mutations can alter normal cells and lead to cancer development. Yet distinguishing functional driver mutations from neutral passenger mutations remains a significant challenge. Traditional genomic tools often prioritize linear overlap searches, failing to capture the complex, three-dimensional regulatory environment of the genome. We present a graph-based framework, MutationNetwork, for constructing mutation-centric networks by integrating long-range intrachromosomal interactions with local genomic overlaps. Our method utilizes a unique positive and negative indexing scheme to represent interacting genomic intervals as nodes. By encoding both interactions and overlaps as edges, we enable constant-time retrieval of complex relationship data. By iteratively expanding the graph from a seed mutation, we can quantify a mutations influence on the genomic landscape and assess its proximity to genes. We applied this framework to a dataset of 560 breast cancer whole-genome sequences, focusing on Triple-Negative Breast Cancer (TNBC) and Luminal A subtypes. Our results demonstrate that the generated mutation embeddings successfully cluster samples according to their biological subtypes, with the highest classification performance achieved at specific ranges. This approach provides a comprehensive view of mutation impact, offering a scalable solution for cancer patient stratification and the prioritization of potential non-coding driver mutations by assessing their network-level impact. Availability and implementationThe source code is available at https://github.com/Ramalh/MutationNetwork
Colangelo, G.; Marti, M.
Show abstract
The space of possible phenotype profiles over the Human Phenotype Ontology (HPO) is combinatorially vast, whereas the space of candidate disease genes is far smaller. Phenotype-driven diagnosis is therefore highly non-bijective: many distinct symptom profiles can correspond to the same gene, but only a small fraction of the theoretical phenotype space is biologically and clinically plausible. When a structured ontology exists, this constraint can be exploited to generate realistic synthetic cases. We introduce GraPhens, a simulation framework that uses gene-local HPO structure together with two empirically motivated soft priors, over the number of observed phenotypes per case and phenotype specificity, to generate synthetic phenotype-gene pairs that are novel yet clinically plausible. We use these synthetic cases to train GenPhenia, a graph neural network that reasons over patient-specific phenotype subgraphs rather than flat phenotype sets. Despite being trained entirely on synthetic data, GenPhenia generalizes to real, previously unseen clinical cases and outperforms existing phenotype-driven gene-prioritization methods on two real-world datasets. These results show that when patient-level data are scarce but a structured ontology is available, principled simulation can provide effective training data for end-to-end neural diagnosis models.
Nguyen, H.; Li, C.; Peng, C.; Simpson, P.; Ye, N.; Nguyen, Q.
Show abstract
Foundation models for computational pathology have rapidly emerged as powerful tools for extracting rich biological and morphological representations from histopathology images. However, variations in model architecture, pre-training data, and optimization objectives often lead to task-dependent performance, rather than universal generalization. As a result, effective strategies for integrating their complementary strengths are essential to fully realize the potential of foundation models for robust histopathology analysis. Meanwhile, recent breakthroughs such as spatial transcriptomics provide an unprecedented opportunity to integrate genetic and histopathology information from the same patient sample, thereby maximizing both molecular and anatomical pathology insights. Specifically, each models embedding is first mapped to gene-level predictions via a dedicated prediction head, enabling model-specific feature utilization. A lightweight weighting network then adaptively aggregates these predictions to produce a unified and robust output at gene and spatial location levels. Across multiple spatial transcriptomics datasets, our approach consistently outperforms both individual foundation models and classical ensembling methods. Focusing on breast cancer, we observe substantial gains in prediction accuracy for clinically relevant PAM50 subtype markers and drug-target genes. Moreover, the proposed framework improves interpretability by revealing model-specific contributions and specialization at the gene level. Overall, our work presents an effective solution to integrating multiple foundation models for enhancing the genetic analyses of histopathology images.
Zhou, M.; Zhang, M.; Wang, J.; Shao, C.; Yan, G.
Show abstract
Cardiovascular disease is one of the leading causes of death worldwide, with myocardial infarction (MI) being a major cause of both morbidity and mortality among cardiovascular patients. MI Patients face a higher risk of cardiovascular disease recurrence afterwards. Therefore, accurately predicting the risk of recurrence and identifying key risk factors are crucial for clinical decision-making. In this paper, we consider the interrelationships among cardiovascular factors from a systemic perspective. We first construct a differential network for each patient to capture individual-specific deviations in factor relationships and propose a novel method, termed Causal Factor-aware Graph Neural Network (CFGNN), which integrates factor interactions to predict the recurrence risk of MI patients while uncovering key risk factors from a causal perspective. Experimental results demonstrate that CFGNN performs well on hospital-derived datasets in real world, effectively identifying several key risk factors. This method not only deepens our understanding of cardiovascular disease, but also paves the way for more targeted and effective interventions.
Fletcher, W. L.; Sinha, S.
Show abstract
The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.
Haque, N.; Mazed, A.; Ankhi, J. N.; Uddin, M. J.
Show abstract
Accurate classification of SARS-CoV-2 genomic variants is essential for effective genomic surveillance, yet it is challenged by extreme class imbalance, limited representation of rare variants, and distribution shifts in real-world sequencing data. In this study, we employed hybrid RF-SVM framework designed for robust detection of rare SARS-CoV-2 variants. It integrates a random forest and a polynomial-kernel based support vector machine to enhance sensitivity to minority classes while maintaining overall predictive stability. We systematically compared classical machine learning models, deep learning approaches, and hybrid strategies under both standard and distribution-shifted evaluation settings. Our results show that classical models using TF-IDF-based k-mer features outperform deep learning methods on macro-averaged performance metrics. The Random Forest classifier using TF-IDF Feature achieved the best overall performance, with a macro-averaged F1-score of 0.8894 and an accuracy of 96.3%. The model also demonstrated strong generalization ability, as evidenced by stable cross-validation performance (CV accuracy = 0.9637). Hybrid RF-SVM model further improves rare variant detection under severe class imbalance. Calibration analysis indicates reliable probability estimates for common variants, although challenges persist for minority classes. Overall, this study highlights the limitations of deep learning in highly imbalanced genomic settings and demonstrates that carefully designed hybrid machine learning approaches provide an effective and interpretable solution for rare SARS-CoV-2 variant detection.
Bhati, U.; Gupta, S.; kesarwani, V.; Shankar, R.
Show abstract
Protein-protein interactions (PPIs) are molecular lego which define the physical states of cells. Accurately identifying PPIs remains challenging due to the interplay of several factors ranging from electrostatic to molecular geometry, topology, and physics. Existing computational approaches capture only fragments of this orchestra, limiting their generalizability across protein families and interaction types. Here, we present ProMaya, a hierarchical multi-scale Graph-transformer framework that integrates 3D atomic geometry, electronic distribution, residue-level structure and disorder, surface mass-density signatures, and large protein language-model embeddings of interacting proteins. Highly comprehensively benchmarked across nine species and 47 GB experimentally validated data, ProMaya achieved consistently >95% average accuracy, outperforming state-of-the-art tools by >12%. As driven by its explainability, the first time introduced atomic and protein language information dramatically boosted it to an outstanding level for PPI discovery in any species, potent to even bypass costly experiments. ProMaya system is freely accessible at https://scbb.ihbt.res.in/ProMaya/
Liang, L.; Zhao, K.
Show abstract
Accurate quality assessment of predicted protein-protein complex structures remains a major challenge. Existing graph-based quality assessment methods often treat the entire complex as a homogeneous graph, which obscures the physical distinction between intra-chain folding stability and inter-chain binding specificity. In this study, we introduce TriGraphQA, a novel triple graph learning framework designed for model quality assessment of protein complexes. TriGraphQA explicitly decouples monomeric and interfacial representations by constructing three geometric views: two residue-node graphs capturing the local folding environments of individual chains, and a dedicated contact-node graph representing the binding interface. Crucially, we propose an interface context aggregation module to project context-rich embeddings from the monomers onto the interface, effectively fusing multi-scale structural features. We conducted comprehensive tests on several challenging benchmark datasets, including Dimer50, DBM55-AF2, and HAF2. The results show that TriGraphQA significantly outperforms state-of-the-art single-model methods. TriGraphQA consistently achieves the highest global scoring correlations and lower top-ranking losses. Consequently, TriGraphQA provides a powerful evaluation tool for protein-protein docking, facilitating the reliable identification of near-native assemblies in large-scale structural modeling and molecular recognition studies.
Heaton, H.; Behboudi, R.; Ward, C.; Weerakoon, M.; Kanaan, S.; Reichle, S.; Hunter, N.; Furlan, S.
Show abstract
The existence of rare, genetically distinct cells can occur in various samples such as transplant patients, naturally occurring microchimerism between maternal and fetal tissues, and cancer samples with sufficient mutational burden. Computational methods for detecting these foreign cells are vital to studying these biological conditions. An application that is of particular interest is that of leukemia patients post hematopoietic cell transplant (HCT). In many leukemias, a primary therapy is HCT, after which, the primary genotype of the bone marrow and blood cells should be of donor origin. If cells exist that are of the patients genotype and the cell type lineage of the particular leukemia, this is known as measurable residual disease (MRD). If the MRD is high enough, this may represent a relapse of the patients leukemia. Furthermore, accurately estimating the MRD is important for driving clinical decision making for these patients. Here we present Cellector, a computational method for identifying rare foreign genotype cells in single cell RNAseq (scRNAseq) datasets. We show cellector accurately detects microchimeric cells down to an exceedingly low percentage of these cells present (0.05% or lower).
Li, Y.; Zhao, Q.; Wang, J.
Show abstract
Expert pharmaceutical chemists interpret molecular structures through a sophisticated cognitive hierarchy, transitioning from local functional moieties to spatial pharmacophores and, ultimately, to macroscopic pharmacological and physicochemical profiles. However, conventional Graph Neural Networks frequently overlook this high-level chemical intuition by treating molecules as single-scale atomic topology. To bridge this gap between human expertise and computational inference, we propose PyrMol, a knowledge-structured pyramid representation learning framework. By constructing heterogeneous hierarchical graphs, PyrMol orchestrates information flow across atomic, subgraph, and molecular levels. Crucially, the subgraph level systematically integrates three complementary expert views comprising functional groups, pharmacophores, and retrosynthetic fragments. To harmonize these explicit domain priors with implicit computational semantics, we introduce an adaptive Multi-source Knowledge Enhancement and Fusion module that dynamically balances their complementarity and redundancy. A Hierarchical Contrastive Learning strategy further ensures cross-scale semantic consistency. Empirical evaluations across ten benchmark datasets demonstrate that PyrMol outperforms 12 state-of-the-art baselines. Furthermore, its "plug-and-play" versatility provides a framework-agnostic performance boost for existing GNN architectures. PyrMol thus establishes a principled data-knowledge dual-driven paradigm for AI-aided Drug Discovery, effectively leveraging domain knowledge to catalyze advances in molecular property prediction.
Fan, J.; Liu, F.; Lai, X.
Show abstract
Accurate cell type identification is a fundamental step in single-cell RNA sequencing (scRNA-seq) data analysis, providing critical insights into cellular heterogeneity at high resolution. However, the high dimensionality, zero-inflated, and long-tailed distribution of scRNA-seq data pose significant computational challenges for conventional clustering approaches. Although recent deep learning-based methods utilize contrastive learning to joint-learn representations and clustering assignments, they often overlook cluster-level information, leading to suboptimal feature extraction for downstream tasks. To address these limitations, we propose scRGCL, a single-cell clustering method that learns a regularized representation guided by contrastive learning. Specifically, scRGCL captures the cell-type-associated expression structure by clustering similar cells together while ensuring consistency. For each sample, the model performs negative sampling by selecting cells from distinct clusters, thereby ensuring semantic dissimilarity between the target cell and its negative pairs. Moreover, scRGCL introduces a neighbor-aware re-weighting strategy that increases the contribution of samples from clusters closely related to the target. This mechanism prevents cells from the same category from being mistakenly pushed apart, effectively preserving intra-cluster compactness. Extensive experiments on fourteen public datasets demonstrate that scRGCL consistently outperforms state-of-the-art methods, as evidenced by significant improvements in normalized mutual information (NMI) and adjusted rand index (ARI). Moreover, ablation studies confirm that the integration of cluster-aware negative sampling and the neighbor-aware re-weighting module is essential for achieving high-fidelity clustering. By harmonizing cell-level contrast with cluster-level guidance, scRGCL provides a robust and scalable framework that advances the precision of automated cell-type discovery in increasingly complex single-cell landscapes. Key MessagesO_LIscRGCL uses contrastive learning on a regularized representation for single-cell clustering. C_LIO_LIscRGCL outperforms four state-of-the-art methods on 15 datasets. C_LIO_LIscRGCLs cluster-aware negative sampling and the neighbor-aware re-weighting modules are essential for high-fidelity single cell clustering. C_LI
Shur, A.; Tziony, I.; Orenstein, Y.
Show abstract
Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.
Epain, V.; Mane, A.; Della Vedova, G.; Bonizzoni, P.; Chauve, C.
Show abstract
We address the problem of plasmid binning, that aims to group contigs - from a draft short-read assembly for a bacterial sample - into bins each expected to correspond to a plasmid present in the sequenced bacterial genome. We formulate the plasmid binning problem as a network multi-flow problem in the assembly graph and describe a Mixed-Integer Linear Program to solve it. We compare our new method, PlasBin-HMF, with state-of-the-art methods,MOB-recon, gplasCC, and PlasBin-flow, on a dataset of more than 500 bacterial samples, and show that PlasBin-HMF outperforms the other methods, by preserving the explainability.
Pan, X.; Shrawat, A.; Raghavan, S.; Dong, C.; Yang, Y.; Li, Z.; Zheng, W. J.; Eckhardt, S. G.; Wu, E.; Fuxman Bass, J. I.; Jarosz, D. F.; Chen, S.; McGrail, D. J.; Sheynkman, G. M.; Huang, J. H.; Sahni, N.; Yi, S. S.
Show abstract
Most proteins exert their functions in complex with other interactors. Single mutations can exhibit a profound impact on perturbing protein interactions, leading to human disease. However, predicting the effect of single mutations on protein interactions remains a major computational challenge. Deep learning, particularly protein language models or transformers, has become an effective tool in bioinformatics for protein structure prediction. However, the functional divergence of mutations makes it difficult to predict their interaction perturbation profiles. To address this fundamental challenge, we present eSIG-Net (edgetic mutation Sequence-based Interaction Grammar Network), a novel sequence-based "Interaction Language Model" for predicting protein interaction alterations caused by single mutations. eSIG-Net combines various protein sequence embeddings, introduces a mutation-encoding module with syntax and evolutionary insights, and employs contrastive learning to evaluate mutation-induced interaction changes. eSIG-Net significantly outperforms current state-of-the-art sequence-based and structure-based prediction methods at predicting mutational impact on protein interactions. We highlight examples where eSIG-Net nominates causal variants with high confidence and elucidates their functional role under relevant biological contexts. Together, eSIG-Net is a first-in-kind "interaction language model" that can accurately predict interaction-specific rewiring by single mutations with only sequence information, and exhibits generalizability across biological contexts.
Feng, Y.; Deng, K.; Guan, Y.
Show abstract
Gene networks (GNs) encode diverse molecular relationships and are central to interpreting cellular function and disease. The heterogeneity of interaction types has led to computational methods specialized for particular network contexts. Large language models (LLMs) offer a unified, language-based formulation of GN inference by leveraging biological knowledge from large-scale text corpora, yet their effectiveness remains sensitive to prompt design. Here, we introduce Gene-Relation Adaptive Soft Prompt (GRASP), a parameter-efficient and trainable framework that conditions inference on each gene pair through only three virtual tokens. Using factorized gene-specific and relation-aware components, GRASP learns to map each pair's biological context into compact soft prompts that combine pair-specific signals with shared interaction patterns. Across diverse GN inference tasks, GRASP consistently outperforms alternative prompting strategies. It also shows a stronger ability to recover unannotated interactions from synthetic negative sets, suggesting its capacity to identify biologically meaningful relationships beyond existing databases. Together, these results establish GRASP as a scalable and generalizable prompting framework for LLM-based GN inference.
Zhang, L.; Wang, L.; Sun, X.; Tang, W.; Su, H.; Qian, Y.; Yang, Q.; Li, Q.; Tang, Z.; Sun, H.; Han, Y.; Jiang, Y.; Lou, W.; Zhou, B.; Wang, X.; Bai, L.; Xie, Z.
Show abstract
Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
Lourenco, V. M.; Ogutu, J. O.; Piepho, H.-P.
Show abstract
Data contamination--from recording errors to extreme outliers--can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train-deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective. Author summaryMachine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.
Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.
Show abstract
MotivationMolecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. ResultsWe introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph- derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the models ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. AvailabilitySELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contacttuncadogan@gmail.com
Frost, H. R.
Show abstract
We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.
Garcia, J. J.; Yu, K. M.; Freudenreich, C. H.; Cowen, L.
Show abstract
In Bakers yeast, there exists a comprehensive collection of pairwise epistasis experiments that, for nearly every pair of non-essential genes, measures the growth of the double-knockout strain as compared to its component single knockouts. This data can be represented as a weighted signed graph termed the genetic interaction network, and we introduce a new ILP-based method named GIDEON to search for a diverse collection of Between-Pathway Models (BPMs) in this network, where BPMs are a graph motif signature that indicates potential compensatory pathways in the genetic interaction network. With both an improved distribution-informed edge weighting scheme and an improved ILP method, GIDEON produces BPM collections that are substantially larger and with better functional enrichment compared to previous methods. We find some interesting new BPM gene sets including one with potential insights into antifungal drug targets through ties between ergosterol and aromatic amino acid biosynthesis.